initial_reshape_function <- function(df,s){
  df <- subset(df, df$Grade != "All Grades")
  df$Grade <- as.integer(df$Grade)
  df$Mean.Scale.Score <- as.numeric(df$Mean.Scale.Score)
  df$Level1Percentage <- as.numeric(df$Level1Percentage)
  df$Level2Percentage <- as.numeric(df$Level2Percentage)
  df$Level3Percentage <- as.numeric(df$Level3Percentage)
  df$Level4Percentage <- as.numeric(df$Level4Percentage)
  df$Subject <- s
  df$Category <- factor(df$Category)
  df$Subject <- factor(df$Subject)
  return(df)
}

Analyzing Test Data on a Borough-level Granulatrity 2013-2015

setwd('C:\\Users\\ransf\\Documents\\BoroughData/')
df <- read.csv("BoroughELAResults20132015_ALL.csv")
df <- initial_reshape_function(df,"ELA")
setwd('C:\\Users\\ransf\\Documents\\BoroughData/')
df1 <- read.csv("BoroughELAResults20132015_Gender.csv")
df1 <- initial_reshape_function(df1, "ELA")
setwd('C:\\Users\\ransf\\Documents\\BoroughData/')
df2 <- read.csv("BoroughELAResults20132015_Ethnicity.csv")
df2 <- initial_reshape_function(df2, "ELA")
setwd('C:\\Users\\ransf\\Documents\\BoroughData/')
df3 <- read.csv("BoroughELAResults20132015_ELL.csv")
df3 <- initial_reshape_function(df3, "ELA")
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ela.2013.2015 <- bind_rows(df,df1)
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
ela.2013.2015 <- bind_rows(ela.2013.2015,df2)
ela.2013.2015 <- bind_rows(ela.2013.2015,df3)
rm(df, df1, df2, df3)
setwd('C:\\Users\\ransf\\Documents\\BoroughData/')
df <- read.csv("BoroughMathResults20132015_ALL.csv")
df <- initial_reshape_function(df, "Math")
setwd('C:\\Users\\ransf\\Documents\\BoroughData/')
df1 <- read.csv("BoroughMathResults20132015_ELL.csv")
df1 <- initial_reshape_function(df1, "Math")
setwd('C:\\Users\\ransf\\Documents\\BoroughData/')
df2 <- read.csv("BoroughMathResults20132015_Ethnicity.csv")
df2 <- initial_reshape_function(df2, "Math")
setwd('C:\\Users\\ransf\\Documents\\BoroughData/')
df3 <- read.csv("BoroughMathResults20132015_Gender.csv")
df3 <- initial_reshape_function(df3, "Math")
math.2013.2015 <- bind_rows(df,df1)
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
math.2013.2015 <- bind_rows(math.2013.2015,df2)
math.2013.2015 <- bind_rows(math.2013.2015,df3)

borough_data <- bind_rows(ela.2013.2015,math.2013.2015)
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
rm(ela.2013.2015,math.2013.2015)
rm(df, df1, df2, df3)

Let’s analyze how the mean test scores look across each Borough from 2013-2015.

ggplot(subset(borough_data,Category == "All Students" & Subject == "ELA"), aes(Borough,Mean.Scale.Score)) + geom_boxplot() +  ylab("Mean Score for All students in ELA")

ggplot(subset(borough_data,Category == "All Students" & Subject == "Math"), aes(Borough,Mean.Scale.Score)) + geom_boxplot() +  ylab("Mean Score for All students in Math")

It looks like Queens has the highest Median mean scores across all the boroughs. Now we want to analyze scores for each subject

From the plots, The Bronx has significantly lower test scores then the other boroughs. Staten Island has the highest median mean score in ELA while Queens has the highest median mean score in Mathematics. We now have a look at the scores by Gender across all boroughs.

ggplot(subset(borough_data,Category == c("Female", "Male")), aes(Category,Mean.Scale.Score)) + geom_boxplot() +  ylab("Mean Score for All students")

ggplot(subset(borough_data,Category == c("Female", "Male") & Subject == "ELA"), aes(Category,Mean.Scale.Score)) + geom_boxplot() +  ylab("Mean Score for All students in ELA")

ggplot(subset(borough_data,Category == c("Female", "Male") & Subject == "Math"), aes(Category,Mean.Scale.Score)) + geom_boxplot() +  ylab("Mean Score for All students in Math")

The plots show that Females outperform the males in both Math and ELA. The difference in median scores between males and females is significant. Now let’s look at the breakdown by race across all boroughs.

ggplot(subset(borough_data,Category %in% c("Black", "White", "Hispanic", "Asian") & Subject == "ELA"), aes(Category,Mean.Scale.Score)) + geom_boxplot() +  ylab("Mean Score for All students in ELA")

ggplot(subset(borough_data,Category %in% c("Black", "White", "Hispanic", "Asian") & Subject == "Math"), aes(Category,Mean.Scale.Score)) + geom_boxplot() +  ylab("Mean Score for All students in Math")

The data shows that Asian students outperform all other races in all categories. The median score for Asian students is at least 10 points higher than all other races. It seems that Black and Hispanic students are having significantly lower test scores than other races. Next we will look at the breakdown based on English speaking students and English Language learning students.

ggplot(subset(borough_data,Category %in% c("ELL", "Former ELL", "EP") ), aes(Category,Mean.Scale.Score)) + geom_boxplot() +  ylab("Mean Score for All Students")

ggplot(subset(borough_data,Category %in% c("ELL", "Former ELL", "EP") & Subject == "ELA"), aes(Category,Mean.Scale.Score)) + geom_boxplot() +  ylab("Mean Score for All students in ELA")

ggplot(subset(borough_data,Category %in% c("ELL", "Former ELL", "EP") & Subject == "Math"), aes(Category,Mean.Scale.Score)) + geom_boxplot() +  ylab("Mean Score for All students in Math")

The data shows that English proficient students had the highest median scores which is expected. Given the data, it seems that the barrier of not speaking English is reflective on the test scores outcome. Former English Language Learners seem to do almost as good as English Proficient students.

ggplot(borough_data, aes(Borough,Level1Percentage)) + geom_boxplot() +  ylab("Percentage of Level 1 Students")

ggplot(borough_data, aes(Borough,Level2Percentage)) + geom_boxplot() + ylab("Percentage of Level 2 Students")

ggplot(borough_data, aes(Borough,Level3Percentage)) + geom_boxplot() + ylab("Percentage of Level 3 Students")

ggplot(borough_data, aes(Borough,Level4Percentage)) + geom_boxplot() + ylab("Percentage of Level 4 Students")

I found it interesting that although Queens has relatively the same proportion of Level 2,3, and 4 students, they had the highest median of mean value scores across all the boroughs.

ggplot(subset(borough_data,Category == c("Female", "Male")), aes(Category,Level1Percentage)) + geom_boxplot() +  ylab("Percentage of Level 1 Students")

ggplot(subset(borough_data,Category == c("Female", "Male")), aes(Category,Level2Percentage)) + geom_boxplot() + ylab("Percentage of Level 2 Students")

 ggplot(subset(borough_data,Category == c("Female", "Male")), aes(Category,Level3Percentage)) + geom_boxplot() + ylab("Percentage of Level 3 Students")

ggplot(subset(borough_data,Category == c("Female", "Male")), aes(Category,Level4Percentage)) + geom_boxplot() + ylab("Percentage of Level 4 Students")

It seems that Females had a lower proportion of Level 1 students and a higher proportion of level 2, 3, and 4 students than males.

ggplot(subset(borough_data,Category %in% c("Black", "White", "Hispanic", "Asian")), aes(Category,Level1Percentage)) + geom_boxplot() +  ylab(" Percentage of Level 1 Students")

 ggplot(subset(borough_data,Category %in% c("Black", "White", "Hispanic", "Asian")), aes(Category,Level2Percentage)) + geom_boxplot() +  ylab(" Percentage of Level 2 Students")

 ggplot(subset(borough_data,Category %in% c("Black", "White", "Hispanic", "Asian")), aes(Category,Level3Percentage)) + geom_boxplot() +  ylab(" Percentage of Level 3 Students")

ggplot(subset(borough_data,Category %in% c("Black", "White", "Hispanic", "Asian")), aes(Category,Level4Percentage)) + geom_boxplot() +  ylab(" Percentage of Level 4 Students")

Black students had very large number of Level 1 and 2 students with almost 50% of them Level 1. Asian students had the lowest median proportion of level 1 and 2 students and the highest median proportion of level 3 and 4 students. White students had a lower median proportion of level 1 and 2 students in respect to Black and Hispanic students, but had a relatively equal proportion of Level 2 students as well.

It looks like females’ mean score begin to overtake males’ mean score around 303. Males peak at around 297 while females peak around 307. Females minimum mean score seems to fall around 283. Whites and Asians’ mean scores begin to become a higher proportion of their students around 310 where it begins to peak and slowly decreases as the mean score amount get increases. A large proportion of Black and Hispanics’ mean scores fall between 280 and 300. Black Students’ mean score maxes out around 295. Hispanic Students’ mean score maxes out around 305. EP students seem to peak around the same mean score as females do. ELL students are somewhat evenly spread between the mean scores’ 250 and 290. There seems to be no representation of ELL students past 292.

Analyzing Test Data on a School-level Granulatrity 2013-2015

## A separate function for school data was created because the column formats are different 
##than the column formats in the borough data.  In the school data we need to convert the 
##Percentage columns from "factors" to "characters" first and then convert them to numerics. ##This was done because when converting from "factors" to "numerics", we do not get the same ##initial decimal value.  All observations that have 5 or less students tested have "s" 
##values for each column value so we filter those out.
initial_reshape_school_data <- function(df,s){
  df <- subset(df, df$Grade != "All Grades")
  df$Grade <- as.character(df$Grade)
  df$Grade <- as.integer(df$Grade)
  df <- subset(df, Number.Tested > 5)
  df$Number.Tested <- as.integer(df$Number.Tested)
  df$Level1Percentage <- as.character(df$Level1Percentage)
  df$Level2Percentage <- as.character(df$Level2Percentage)
  df$Level3Percentage <- as.character(df$Level3Percentage)
  df$Level4Percentage <- as.character(df$Level4Percentage)
  
  df$Mean.Scale.Score <- as.numeric(df$Mean.Scale.Score)
  df$Level1Percentage <- as.numeric(df$Level1Percentage)
  df$Level2Percentage <- as.numeric(df$Level2Percentage)
  df$Level3Percentage <- as.numeric(df$Level3Percentage)
  df$Level4Percentage <- as.numeric(df$Level4Percentage)
  df$Subject <- s
  return(df)
}
setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df <- read.csv("SchoolMathResults20132015_ALL.csv")
df <- initial_reshape_school_data(df,"Math")
setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df1 <- read.csv("SchoolMathResults20132015_ELL.csv")
df1 <- initial_reshape_school_data(df1,"Math")
setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df2 <- read.csv("SchoolMathResults20132015_Ethnicity.csv")
df2 <- initial_reshape_school_data(df2,"Math")
setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df3 <- read.csv("SchoolMathResults20132015_Gender.csv")
df3 <- initial_reshape_school_data(df3,"Math")
math.2013.2015.school <- bind_rows(df,df1)
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
math.2013.2015.school <- bind_rows(math.2013.2015.school,df2)
math.2013.2015.school <- bind_rows(math.2013.2015.school,df3)

rm(df,df1,df2,df3)
setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df <- read.csv("SchoolELAResults20132015_ALL.csv")
df <- initial_reshape_school_data(df,"ELA")
setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df1 <- read.csv("SchoolELAResults20132015_ELL.csv")
df1 <- initial_reshape_school_data(df1,"ELA")
setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df2 <- read.csv("SchoolELAResults20132015_Ethnicity.csv")
df2 <- initial_reshape_school_data(df2,"ELA")
setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df3 <- read.csv("SchoolELAResults20132015_Gender.csv")
df3 <- initial_reshape_school_data(df3,"ELA")
ela.2013.2015.school <- bind_rows(df,df1)
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
ela.2013.2015.school <- bind_rows(ela.2013.2015.school,df2)
ela.2013.2015.school <- bind_rows(ela.2013.2015.school,df3)

school_data <- bind_rows(math.2013.2015.school,ela.2013.2015.school)
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
rm(math.2013.2015.school,ela.2013.2015.school)

I thought that it would be good to add a field for which borough the school resides in so I wrote a function to perform that action. This is done based on the letter in the DBN number for the school: “M” is for Manhattan, “K” is for Brooklyn, “X” is for the Bronx, “Q” is for Queens, and “R” is for Staten Island.

## We extract the Borough information by analyzing the DBN for each school.
## If the DBN number has a specific character, then we assign it the appropriate
## Borough
create_borough_vector <- function (v){
  
  cond <- grepl(".+M.+",v$DBN,perl = TRUE)
  v$Borough <- ifelse(cond,"Manhattan","Other Borough")
  cond <- grepl(".+X.+",v$DBN,perl = TRUE)
  v$Borough <- ifelse(cond,"Bronx",v$Borough)
   cond <- grepl(".+K.+",v$DBN,perl = TRUE)
  v$Borough <- ifelse(cond,"Brooklyn",v$Borough)
   cond <- grepl(".+Q.+",v$DBN,perl = TRUE)
  v$Borough <- ifelse(cond,"Queens",v$Borough)
   cond <- grepl(".+R.+",v$DBN,perl = TRUE)
  v$Borough <- ifelse(cond,"Staten Island",v$Borough)
  return(v)
}
school_data <- create_borough_vector(school_data)

Since the percentage of level 1,2,3,4 students is the only continuous value field that is in this dataset, I thought it would be fitting for Bivariate Analysis. The first plot is the Mean Score vs. Level1Percentage.

ggplot(school_data, aes(x = Level1Percentage, y = Mean.Scale.Score)) + geom_point()

Since there is a lot of over plotting, we use the alpha parameter to reduce it.

ggplot(school_data, aes(x = Level1Percentage, y = Mean.Scale.Score)) + geom_point(alpha = 1/25)

ggplot(school_data, aes(x = Level2Percentage, y = Mean.Scale.Score)) + geom_point(alpha = 1/25)

ggplot(school_data, aes(x = Level3Percentage, y = Mean.Scale.Score)) + geom_point(alpha = 1/25)

ggplot(school_data, aes(x = Level4Percentage, y = Mean.Scale.Score)) + geom_point(alpha = 1/25)

Looking at the Level 1 Percentage plot, we see some interesting trends at the extremes (0 percent and 100 percent). Schools that tested 0 percent of Level 1 students had a mean score range from 300 to 375 whereas schools that tested 100 percent of Level 1 students had a mean score range from 225 to 275. Most schools have Level 1 students within the range of 0 and 70 percent. There seems to be a strong linear relationship with the number of Level 1 students tested and mean score for the schools in the NYC area.

The plot for the Mean Score vs. Level 2 percentage is pretty interesting. It generates an arrow-shaped plot. There is huge variation in the range of Mean scores for schools who tested students with a Level 2 Percentage of 25 or less. After the 25 percent threshold, the mean scores for schools seems to converge to 300.

The plot for the Mean Score vs. Level 3 percentage shows a closer linear relationship. Most schools tested students with Level 3 percentages in the range of 3 to about 35 percent. The mean scores within this range fall mostly between 275 and 325. After the 35 percent threshold, the mean scores continue to rise but the number of schools with higher percentages of Level 3 students slowly diminishes.

Most schools have students with Level 4 between 1 and 25 percent. Even at low number of Level 4 percentages, most mean scores are above 300. We see that the number of Level 4 students slowly diminishes after 25 percent but the mean scores steadily increases reaching mean scores as high as 380.

Now that we have a good field for the data for the past two years, lets wrangle the older data and combine it with the most recent data to see if we can see any trends. While looking at the data from 2006-2012, I noticed that the mean scores were higher in the older data than in the 2013-2015 Data. After doing some searching on the web, I found this site. The documents here show that the score scales vary year over year, so using the mean scale score will not be very useful when combining data from 2006-2012 with data from 2013-2015.

Analyzing Test Data on a School-level Granulatrity 2006-2015

setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df1 <- read.csv("SchoolELAResults20062012Public_ALL.csv")
df1 <- initial_reshape_school_data(df1, "ELA")
setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df2 <- read.csv("SchoolELAResults20062012Public_ELL.csv")
df2 <- initial_reshape_school_data(df2, "ELA")

setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df3 <- read.csv("SchoolELAResults20062012Public_Gender.csv")
df3 <- initial_reshape_school_data(df3, "ELA")

setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df4 <- read.csv("SchoolELAResults20062012Public_Ethnicity.csv")
df4 <- initial_reshape_school_data(df4, "ELA")

ela.2006.2012.data <- bind_rows(df1,df2)
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
ela.2006.2012.data <- bind_rows(ela.2006.2012.data,df3)
ela.2006.2012.data <- bind_rows(ela.2006.2012.data,df4)
setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df1 <- read.csv("SchoolMathResults20062012Public_ALL.csv")
df1 <- initial_reshape_school_data(df1, "Math")

setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df2 <- read.csv("SchoolMathResults20062012Public_ELL.csv")
df2 <- initial_reshape_school_data(df2, "Math")

setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df3 <- read.csv("SchoolMathResults20062012Public_Ethnicity.csv")
df3 <- initial_reshape_school_data(df3, "Math")

setwd('C:\\Users\\ransf\\Documents\\SchoolData')
df4 <- read.csv("SchoolMathResults20062012Public_Gender.csv")
df4 <- initial_reshape_school_data(df4, "Math")

math.2006.2012.data <- bind_rows(df1,df2)
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
math.2006.2012.data <- bind_rows(math.2006.2012.data, df3)
math.2006.2012.data <- bind_rows(math.2006.2012.data, df4)
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
school_data.2006.2012 <- bind_rows(ela.2006.2012.data,math.2006.2012.data)

school_data <- bind_rows(school_data,school_data.2006.2012)
school_data <- create_borough_vector(school_data)

After combining all the data from 2006-2015, we have over 630,000 observations. First we examine the Level1Percentage vs. Number.Tested with aesthetic coloring of the Boroughs.

ggplot(school_data, aes(x = Number.Tested, y = Level1Percentage)) + geom_point(aes(color = Borough))

Looking at the graph, I notice a large percentage of observations for the Bronx and Brooklyn residing between 30 and 80 percent. These observations also seem to fall less than 200 students tested. Observations for Queens and Staten Island seem to fall below 25 percent when the number of students range from 100 to 500. It is tough to analyze observations for Manhattan in this plot. To get a better view, we will generate a plot with only observations for Manhattan

ggplot(subset(school_data, Borough == "Manhattan" ), aes(x = Number.Tested, y = Level1Percentage)) + geom_point(aes(color = Borough))

The plot for Manhattan looks roughly the same as the original: high percentages of Level 1 students for those tested below 200 and lower percentages for those tested more than 200. Let’s do a facet wrap on the data based on Borough to see if this trend is across all boroughs.

ggplot(school_data, aes(x = Number.Tested, y = Level1Percentage)) + geom_point() + facet_wrap(~Borough)

It seems that across all boroughs, the percentage of Level 1 students decrease after the number of students tested passes a specific threshold. My suspicion is that the number of Level 1 students remains constant, but since the population size of students tested increases, they become a smaller percentage. Let’s see if the data supports this hypothesis by comparing the Level1Count vs Number.Tested faceted over Boroughs

ggplot(school_data, aes(x = Number.Tested, y = as.integer(Level1Count))) + geom_point() + facet_wrap(~Borough)

My hypothesis is somewhat true, but we can see in Brooklyn and Queens that the number of Level 1 students decrease as the number of students tested increases. Let’s see how the same plot faceted over years looks like:

ggplot(school_data, aes(x = Number.Tested, y = Level1Percentage)) + geom_point() + facet_wrap(~Year)

The trend seems to be the same from 2006-2012 with a consistent decrease in Level 1 students’ percentage from 2006-2009 for population sizes less than 200. From 2013 to 2015, there are larger percentages of Level 1 students across a wider range of student populations. This could be due to the change in NY Testing to be more Common Core aligned in 2013 as mentioned in the Notes of the dataset.

ggplot(school_data, aes(x = Number.Tested, y = Level1Percentage)) + geom_point(aes(color=Borough)) + facet_wrap(~Grade)

ggplot(school_data, aes(x = Number.Tested, y = Level2Percentage)) + geom_point(aes(color=Borough)) + facet_wrap(~Grade)

ggplot(school_data,aes(x = Number.Tested, y = Level3Percentage)) + geom_point(aes(color=Borough)) + facet_wrap(~Grade)

ggplot(school_data,aes(x = Number.Tested, y = Level4Percentage)) + geom_point(aes(color=Borough)) + facet_wrap(~Grade)

For Level 1 students, there seems to be a higher concentration of Bronx schools with a high percentage level 1 students in Grades 4 and 5, but seems to diminish in grades 7 and 8. In grades 3-5, the distribution seems fairly uniform across the number.tested. In contrast, for grades 6-8, there seems to be a decreasing trend in percentage of level 1 students as the number of tested increases. For Level 2 students, there seems to be large proportion around the 50 percent mark across all grades. For Level 3 students, Brooklyn is neatly condensed between ranges 25-75% for grades 3-5 and has a more scattered distribution in the higher grades. You start to see more Queens and Staten Island observation points in the higher percentages in this plot. Brooklyn has a noticeable concentration of points in the higher percentage end in grades 6-8 and the Bronx is mostly concentrated on the low end of the percentage scale across all grades. In Level 4 students, we see a lot more points representing Manhattan. There is a good concentration of Manhattan schools at the higher percentage end for all grades. Staten Island has dense concentration around 50-75% in grades 3-5 and more scattered distribution on the low percentage side in the higher grades. Brooklyn has a high percentage of Level 4 students in grades 6-8 and Queens has high percentage of level 4 students across all grades.

ggplot(school_data, aes(x = Number.Tested, y = Level1Percentage)) + geom_point(aes(color=factor(Grade))) + facet_wrap(~Year)

ggplot(school_data,aes(x = Number.Tested, y = Level2Percentage)) + geom_point(aes(color=factor(Grade))) + facet_wrap(~Year)

ggplot(school_data, aes(x = Number.Tested, y = Level3Percentage)) + geom_point(aes(color=factor(Grade))) + facet_wrap(~Year)

ggplot(school_data, aes(x = Number.Tested, y = Level4Percentage)) + geom_point(aes(color=factor(Grade))) + facet_wrap(~Year)

2009 seems to be the best year for testing as there was a high concentration of low percentage of level 1 students, and a high concentration of high percentage of level 3 students. We can see there was a big disruption in the distribution of test scores from 2013 when NYC decided to align more with the Common Core Standards. 8th graders seem to be having a difficult time with the standardized test as they dominate a large proportion in level 1 and 2 students. 6th and 7th graders seem to be doing the best over the years as they have large concentration of Level 3 and 4 students. 3rd and 6th graders seem to have weathered the change in Test standards in 2013. There is a dense concentration of Level 4 students in grades 3-5 from 2006-2009 but seems to dwindle after that.

ggplot(school_data, aes(x = Number.Tested, y = Level1Percentage)) + geom_point(aes(color=factor(Grade))) + facet_wrap(~Subject)

ggplot(school_data, aes(x = Number.Tested, y = Level2Percentage)) + geom_point(aes(color=factor(Grade))) + facet_wrap(~Subject)

ggplot(school_data, aes(x = Number.Tested, y = Level3Percentage)) + geom_point(aes(color=factor(Grade))) + facet_wrap(~Subject)

ggplot(school_data, aes(x = Number.Tested, y = Level4Percentage)) + geom_point(aes(color=factor(Grade))) + facet_wrap(~Subject)

Performance seems to be good from grades 3-7, but falls drastically in 8th grade. Grades 4-6 seems to be the top performers in ELA. 8th graders seem to have a large concentration of Level 1 and 2 students in both subjects.

school_data_group <- school_data %>% group_by(DBN,Year,Grade) %>% summarise(Level1MeanPercent = mean(Level1Percentage), Level2MeanPercent = mean(Level2Percentage),Level3MeanPercent = mean(Level3Percentage),Level4MeanPercent = mean(Level4Percentage), n = n())

school_data_group <- create_borough_vector(school_data_group)                                                                 

I have created another dataset which has the average percentages across all levels of students. Let’s analyze this dataset to see if it gives us any additional insight.

ggplot(school_data_group, aes(x = Level1MeanPercent)) + geom_freqpoly(aes(color = Borough)) + facet_wrap(~Year)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

ggplot(school_data_group, aes(x = Level2MeanPercent)) + geom_freqpoly(aes(color = Borough)) + facet_wrap(~Year)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

ggplot(school_data_group, aes(x = Level3MeanPercent)) + geom_freqpoly(aes(color = Borough)) + facet_wrap(~Year)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

ggplot(school_data_group, aes(x = Level4MeanPercent)) + geom_freqpoly(aes(color = Borough)) + facet_wrap(~Year)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The first thing I notice is how the Level 1 and 4 student data is skewed and the Level 2 and 3 student data is fairly normalized. The average percent of Level 1 students seems to be similarly distributed across all boroughs, but there is more distinction in the Level 2 data. Queens and Staten Island’s mean are before 25 percent from 2006-2009, but align with the other boroughs’ distributions from 2010 onwards. The Bronx’s normal distribution slowly shifts right each year up until 2013, then the mean value shifts back left and stays there from 2013-2015. Manhattan’s mean oscillates left from 2006-2009 then to the right from 2010-2012 and stabilizes from 2013-2015.

In the Level 3 Data, we can see why 2009 seems to be the best year across them all as there are a large count of schools that have average percent of greater than 50% across all boroughs. The trend shows that 2006-2009 had high average percentages of Level 3 students, but that trend decreases each year afterwards.

The average percentage of Level 4 students is below 25% for all boroughs across all years.

ggplot(school_data_group, aes(x = Level1MeanPercent)) + geom_freqpoly(aes(color = Borough)) + facet_wrap(~Grade)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

ggplot(school_data_group, aes(x = Level2MeanPercent)) + geom_freqpoly(aes(color = Borough)) + facet_wrap(~Grade)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

ggplot(school_data_group, aes(x = Level3MeanPercent)) + geom_freqpoly(aes(color = Borough)) + facet_wrap(~Grade)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

ggplot(school_data_group, aes(x = Level4MeanPercent)) + geom_freqpoly(aes(color = Borough)) + facet_wrap(~Grade)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

3-5th graders had a larger number of schools with low percentage averages for Level 1 students and Level 2 students (< 50%). 3-5th graders had a high number of high percentage averages of Level 3 students. The number of Level 4 students with averages percentages greater than 25% are larger in 3rd-5th graders than 6-8th graders.

6-8th grade students had a large number of schools with average percentages around 50% for Level 2 students and average percentages around 20-35% for Level 3 students.

ggplot(school_data_group, aes(y = Level1MeanPercent, x = Borough)) + geom_boxplot()

ggplot(school_data_group, aes(y = Level2MeanPercent, x = Borough)) + geom_boxplot()

ggplot(school_data_group, aes(y = Level3MeanPercent, x = Borough)) + geom_boxplot()

ggplot(school_data_group, aes(y = Level4MeanPercent, x = Borough)) + geom_boxplot()

The median value for average mean percentages are roughly equivalent for Level 1, 2, and 3 students for all boroughs. The median value for average mean percentages for Level 4 students are equivalent for Bronx, Brooklyn, and Manhattan, whereas the median value for average mean percentages for Level 4 students are roughly equal for Queens and Staten Island.

ggplot(school_data_group, aes(y = Level1MeanPercent, x = factor(Year))) + geom_boxplot()

ggplot(school_data_group, aes(y = Level2MeanPercent, x = factor(Year))) + geom_boxplot()

ggplot(school_data_group, aes(y = Level3MeanPercent, x = factor(Year))) + geom_boxplot()

ggplot(school_data_group, aes(y = Level4MeanPercent, x = factor(Year))) + geom_boxplot()

These plot support the claim that 2009 was the best year amongst them all showing lower median average percentages for Level 1 and 2 students and higher median average percentages for Level 3 and 4 students. Level 1 and 3 plots show the most drastic changes across the years. In the Level 1 plot, the median values decrease over 2006-2009, then increased and remain constant from 2010-2012 and then increases sharply and remains constant from 2013-2015. The inverse trend is shown for Level 3 students.

ggplot(school_data_group, aes(y = Level1MeanPercent, x = factor(Grade))) + geom_boxplot()

ggplot(school_data_group, aes(y = Level2MeanPercent, x = factor(Grade))) + geom_boxplot()

ggplot(school_data_group, aes(y = Level3MeanPercent, x = factor(Grade))) + geom_boxplot()

ggplot(school_data_group, aes(y = Level4MeanPercent, x = factor(Grade))) + geom_boxplot()

The median value across grades is roughly equivalent across all grades with a slight lower value for grades 3-5 for Level 1 and 2 students and slightly higher value for grades 3-5 for Level 3 and 4 students.

Final Plots and Summary

Plot One

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

Description One

It is shown that 2009 had the highest median percentage value of proficient students over all the years where the IQR for 2009 falls between 63% and 89%. From 2006-2009 the median percentage of proficient students lies above 50% and falls below 50% from 2010-2011. After the sharp decline in 2013, the median percentage value reaches 25% in 2015. From 2013-2015, 75% of the schools have less than a 50% proficient students according to the standards.

Plot Two

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Description Two

The trends for Black and Hispanic students are roughly the same. The trends for White and Asian students also follow the same pattern. If we look at the trend lines for Black and Hispanic students vs. White and Asian Students, we can see the interception point occurring around the 80% proficient mark.

Plot Three

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

Description Three

The IQR for Manhattan falls almost exactly between 25 and 75% giving it the largest IQR across all boroughs. 75% of the Bronx schools have less than a 55% proficient rate for its students.

Reflection

Analyzing the NYC Test data gave me quite some insight about their education system. Given that my fiancée and my 11 year old stepson reside in the Bronx, I had a lot of interest in finding out how well the test scores were in the Bronx borough. Although I was not pleased with my findings, I’m was not surprised. The Bronx is one of the most impoverished and economically challenged boroughs in New York City. I ran into a roadblock when I wanted to try to correlate the Borough Test Data with the budget funding for each school to see if there was any relationship with lower test scores in lower funded schools. There is no dataset provided and no API to access and retrieve this information. To retrieve this information, I would have to perform some Screen Scraping tasks to generate this info using the way the HTML accesses this information. To do this for tens of thousands of schools didn’t seem like the best thing to do given the time constraint. I also ran into a roadblock when I noticed that the data type for the percentages in the data from 2006-2012 was different than the percentages in 2012-2013. Thus I had to write separate functions to wrangle the datasets before combining them. My final roadblock was when I discovered that the Mean Scores varied over years, thus it would not be a reliable metric to measure when analyzing data over the years. This is why I stop analyzing it once I combined the earlier data with the 2013-2015 dataset. I did have success with the many ways the data was broken down (Gender, Race, Borough, English Learner, Grade, etc.) which allowed me to look at the data through different categories. For future work, I would like to look at the top 10 performing schools and the bottom 10 performing schools and see if there is any insight that I can find as to why the top 10 schools performs so well and the bottom 10 are not doing so well. This could be analyzed on Borough-by-borough basis or grade level basis. Also, there seem to be a breakdown on test scores once students reached 8th grade that would need some further investigation as well. Finally, I would like to get more insight as to why Black and Hispanic students are not performing as well as their White and Asian student counterparts. This is very important to me as an African-American male who faced the same disconnect when growing up in the city of Atlanta, Georgia.